Skip to content

Conversation

@tlrx
Copy link
Member

@tlrx tlrx commented Mar 7, 2025

Today RefreshListener implementations can access the current engine reference during the reader(s) refresh by using the IndexShard#getEngineOrNull() method (or another method that ultimately calls it).

In a future change we'd like to guard the mutation of shard's engine using a read/write lock instead of the actual engineMutex object monitor. It means that closing or resetting the engine will be done while holding a write lock and should be able to complete without depending on some other thread doing some work requiring a read lock, otherwise the mutation is at risk to deadlock. Such a deadlock has been identified within RefreshListener that access the engine while holding the refreshLock.

This change introduces an EngineAwarRefreshListener interface that can be implemented by refresh listeners to be notified when a new engine is created. This way refresh listeners already know the current engine instance at the time of the reader is refreshed, and therefore do not need to call getEngineOrNull().

To ensure that refresh listeners are not accessing getEngineOrNull() during refreshes, they are now wrapped into an AssertingRefreshListener when they are added to the reader reference manager. AssertingRefreshListener uses a SafeEngineAccessThreadLocal to set a thread-local flag before executing the beforeRefresh() and afterRefresh() methods and resets the flag after the methods are executed. Finally, the flag is checked when the getEngineOrNull() is executed, throwing if it is set and printing the stack trace.

@tlrx tlrx added >non-issue :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v9.1.0 labels Mar 7, 2025
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 7, 2025
boolean forced,
@Nullable TimeValue postWriteRefreshTimeout
) {
Engine engineOrNull = indexShard.getEngineOrNull();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This invocation has been caught by the new SafeEngineAccessThreadLocal

ReaderContext readerContext = null;
try {
searcherSupplier = shard.acquireSearcherSupplier();
searcherSupplier = shard.acquireSearcherSupplier(); // TODO I think we need to fork here
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading the code it invokes getEngine and therefore might cause the new assertion to fail on CI.

It's likely that we don't have many tests using PITs and that's why I haven't seen it yet, but let's see if that fails.

@tlrx
Copy link
Member Author

tlrx commented Mar 7, 2025

This one is going to be complex to fix:

[2025-03-07T18:43:45,511][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] fatal error in thread [elasticsearch[test-cluster-0][refresh][T#3]], exiting
java.lang.AssertionError: Current thread has made an unsafe access to the engine
	at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.getAccessorSafe(SafeEngineAccessThreadLocal.java:63) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.accessEnd(SafeEngineAccessThreadLocal.java:77) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.AbstractReaderManager$AssertingRefreshListener.afterRefresh(AbstractReaderManager.java:100) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) ~[lucene-core-10.1.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) ~[lucene-core-10.1.0.jar:?]
	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) ~[lucene-core-10.1.0.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:2044) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.lambda$maybeRefresh$9(InternalEngine.java:2017) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:367) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:2017) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$scheduledRefresh$47(IndexShard.java:4056) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListener.run(ActionListener.java:463) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4037) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:1086) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1222) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:138) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:977) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1575) ~[?:?]
Caused by: java.lang.AssertionError: thread [Thread[#109,elasticsearch[test-cluster-0][refresh][T#3],5,main]] should not access the engine using the getEngineOrNull() method
	at org.elasticsearch.index.engine.SafeEngineAccessThreadLocal.assertNoAccessByCurrentThread(SafeEngineAccessThreadLocal.java:90) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.getEngineOrNull(IndexShard.java:3317) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:3305) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.IndexShard.getLocalCheckpoint(IndexShard.java:2974) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.localCheckpoint(TransportReplicationAction.java:1192) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.updateCheckPoints(ReplicationOperation.java:384) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:202) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:197) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult$1.onSuccess(TransportWriteAction.java:341) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.maybeFinish(TransportWriteAction.java:498) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction$1.onResponse(TransportWriteAction.java:524) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction$1.onResponse(TransportWriteAction.java:516) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.RefreshListeners.lambda$addOrNotify$1(RefreshListeners.java:152) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.RefreshListeners.fireListeners(RefreshListeners.java:432) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.shard.RefreshListeners.afterRefresh(RefreshListeners.java:420) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	at org.elasticsearch.index.engine.AbstractReaderManager$AssertingRefreshListener.afterRefresh(AbstractReaderManager.java:98) ~[elasticsearch-9.1.0-SNAPSHOT.jar:?]
	... 17 more

@tlrx tlrx force-pushed the ensure-safe-engine-access-in-refresh-listeners branch from 26a7719 to 9d51165 Compare March 11, 2025 12:25
@tlrx
Copy link
Member Author

tlrx commented Mar 28, 2025

Implemented in #124635, closing.

@tlrx tlrx closed this Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >non-issue serverless-linked Added by automation, don't add manually v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants